Language/OS - Multiplatform Resource Library

home *** CD-ROM | disk | FTP | other *** search

/ Language/OS - Multiplatform Resource Library / LANGUAGE OS.iso / cocktail / doc.lha / doc.doc / rex.doc < prev next >

Wrap

Text File | 1992-09-25 | 91KB | 2,839 lines

___________________________________________________________________ Rex A Scanner Generator J. Grosch ___________________________________________________________________ ___________________________________________________________________ GESELLSCHAFT FUeR MATHEMATIK UND DATENVERARBEITUNG MBH FORSCHUNGSSTELLE FUeR PROGRAMMSTRUKTUREN AN DER UNIVERSITAeT KARLSRUHE ___________________________________________________________________ Project Compiler Generation ___________________________________________________________ Rex - A Scanner Generator Josef Grosch July 31, 1992 ___________________________________________________________ Report No. 5 Copyright c 1992 GMD Gesellschaft fuer Mathematik und Datenverarbeitung mbH Forschungsstelle an der Universitaet Karlsruhe Vincenz-Priesznitz-Str. 1 D-7500 Karlsruhe Rex 1 1. Introduction Rex generates programs to be used in lexical analysis of text. A typical application is the generation of scanners for compilers. Rex stands for Regu- lar EXpression tool. In principle it is a remake of LEX [Les75]. Rex processes a specification containing regular expressions to be searched for, and actions written in C or Modula-2 to be executed when expres- sions are found. Unrecognized portions of the input are copied by default to standard output. Rex generates a table-driven scanner consisting of a scanner routine and control tables. The scanner routine implements a tunnel automaton [Gro89] and contains a copy of the specified actions. The scanners generated by Rex are 5 times faster and up to 5 times smaller than those generated by LEX. It is possible to reach a speed of 180,000 to 195,000 lines per minute on a MC 68020 processor (including input from file). If hashing of identifiers is performed additionally the speed is between 125,000 and 150,000 lines per minute. The generator Rex itself is 10 to 20 times faster than LEX in typical cases. Like LEX, Rex has all the features necessary to scan contemporary languages: that is the left and the right context can be taken into account to identify a token. The left context is handled by so-called start states and the right context by additional regu- lar expressions. The source coordinates (line and column number) of recog- nized words are calculated automatically. Scanners can be generated in the languages C and Modula-2. Rex itself is implemented in Modula-2. The following chapters constitute the user manual of Rex. Chapter 2 gives an overview of the operation of Rex and how its output is to be integrated in e. g. compilers. Chapter 3 describes the specification language. Chapter 4 summarizes the predefined items of the specification language. Chapter 5 contains the specification of the interface of the gen- erated scanners. Chapter 6 shows how to invoke and use Rex. Chapter 7 con- tains some details of the implementation. Chapter 8 describes the differences between Rex and LEX for those already familiar with LEX. The appendices con- tain a grammar for the input language and some examples. 2. Overview Figure 1 gives an overview of the observable behaviour of Rex. It takes as input a specification of a lexical analyser written in the language described in the next chapter. The output is the source text of a scanner and scanner tables (in case of Modula-2). The source text consists of a specifica- tion and a body part. These parts are files with the suffixes 'h' and 'c' if C is the target language. In the case of Modula-2 the parts are a definition and an implementation module. The scanner requires a source module to get blocks of characters e. g. by input from file. Rex can be asked to provide a proto- type source module which performs input from the UNIX standard input file. Additionally Rex can be asked to provide a main program to serve as test Fig. 1: Rex Overview Rex 2 driver of the scanner. This main program calls the scanner routine until the end of the input is reached. The above mentioned source programs constitute the minimum configuration to run the generated scanner. What is happening after the compilation of the program modules is shown in the "run time" half of Fig. 1. During initializa- tion the scanner reads its tables from a file (in case of Modula-2 - C uses initialized arrays). Then the scanner driver starts calling the scanner rou- tine which in turn sometimes calls the source module routines to get charac- ters. The data flow is in the opposite direction. The source module returns blocks of characters to the scanner. The scanner analyzes the character stream, executes the associated actions upon finding character sequences matched by regular expressions, and eventually returns tokens to the scanner driver. In general the scanner driver can be replaced by any other main pro- gram or subroutine like e. g. a parser. 3. Specification Language The input of Rex consists of 3 parts: - code written in the target language to be copied unchanged to the output (see 3.7.) - definitions of named regular expressions and start states (see 3.4. + 3.5.) - a set of regular expressions with associated actions written in the target language (see 3.2.) The first two parts are optional. We discuss the three parts in reverse order after introducing some lexical conventions. 3.1. Lexical Conventions The specification can be written in unformatted manner. That means white space in the form of blanks, tab characters, and newline characters has no meaning except to separate other items. Comments are written in the style of C: text included in '/*' and '*/' is ignored. Comments may not be nested. The specification uses a few keywords which should be escaped if needed as iden- tifiers (see below): BEGIN CLOSE DEFAULT DEFINE EOF EXPORT GLOBAL LOCAL NOT RULE RULES SCANNER START The following special characters are used as operators, delimiters, or escape characters: = . , : :- " # + - * / | ? ( ) [ ] { } < > \ Besides keywords and the above special characters a scanner specification is composed of characters, numbers, identifiers, strings, and actions. A character denotes itself. Special characters have to be escaped using a preceding escape character. The escape character is a backslash: '\'. For certain non-graphic characters the same escape sequences as in C are possible: Rex 3 newline NL \n horizontal tab HT \t vertical tab VT \v backspace BS \b carriage return CR \r form feed FF \f Other unprintable characters are represented by the escape character followed by an integer decimal number giving the internal coding. ; \+ \\ \n \10 Numbers denote numerical integer values. They consist of a sequence of digits. 8 12 0 Identifiers are used to refer to named entities. They consist of a letter followed by letters, digits, or underscore characters '_'. Lower case as well as upper case letters are possible. If an identifier is not defined its char- acter sequence is treated as a string. Identifiers that are keywords have to be escaped by a preceding escape character. letter HexDigit under_score \BEGIN END Strings denote a sequence of characters. They consist of a sequence of characters enclosed in double quotes '"'. It is not possible to include a dou- ble quote or an newline character into a string. No escape is needed within strings. It is a shorthand for escaping a whole sequence of characters. "BEGIN" ":=" "\" Actions are statements to be copied unchanged into the generated code. The statements have to be written in the desired target language. The actions have to be enclosed in braces '{' '}'. The characters '{' and '}' can be used within the actions as long as they are either properly nested or contained in strings or in character constants. Otherwise they have to be escaped by a backslash character '\'. The escape character '\' has to be escaped by itself if it is used outside of strings or character constants: '\\'. In general, a backslash character `\` can be used to escape any character outside of strings or character constants. Within those tokens the escape conventions are dis- abled and the tokens are left unchanged. There are additionally statements available to aid in scanning (see section 4.4.). Rex 4 { printf ("BEGIN recognized\n"); } { return SymBegin; } { if (level > 0) { GetWord (String); Concatenate (Word, String); } } { printf ("} recognized\n"); } 3.2. Regular Expressions In general the specification of a scanner consists of the keyword RULE or RULES followed by a list of regular expressions each one associated with an action. RULE BEGIN : { printf ("BEGIN recognized"); } END : { printf ("END recognized"); } ; : { printf ("; recognized"); } The scanner generated from the above example specification would print an appropriate message upon finding one of the character sequences 'BEGIN', 'END', or ';' in the input whenever they appear. We say a character sequence and a regular expression match if the character sequence has a structure according to the regular expression. In general the input of the scanner is searched for character sequences which match one of the specified regular expressions and the associated action is executed. Input characters which are not matched by any regular expression are copied by default to standard output. The syntax to write regular expressions is as follows (see Appendix 1 for a complete definition of the syntax). The productions are given in increasing precedence: Reg_Expr: Reg_Expr '|' Reg_Expr | Reg_Expr Reg_Expr | Reg_Expr '+' | Reg_Expr '*' | Reg_Expr '?' | Reg_Expr '[' Number ']' | Reg_Expr '[' Number '-' Number ']' | '(' Reg_Expr ')' | Character_Set | Character | Identifier | String . - A character is matched by a single identical character. Rex 5 a matches the character 'a' \t matches a tab character \n matches a newline character \10 matches a newline character (only if ASCII is used) \\ matches the character '\' - A string is matched by a character sequence identical to the characters that make up the string. ":=" matches the character sequence ':=' "\" matches the character '\' - An identifier may be defined to refer to a regular expression. In this case it matches the same characters as the regular expression. An unde- fined identifier is treated like a string, it matches its own character sequence. END matches the character sequence 'END' \NOT matches the character sequence 'NOT' - A number is treated like a string, it matches its own character sequence. 007 matches the character sequence '007' - A character set matches one arbitrary character contained in the set. It is written as a sequence of characters enclosed in braces. Ranges may be used to include intervals of characters. The same escapes as described for characters may be used. Unprintable characters and the following ones have to be escaped within character sets: '-' '}' ' ' '\' The predefined identifier ANY stands for a character set containing every character except the newline character. If a character set is preceded by the operator '-' it matches one arbitrary character except the ones contained in the set. { +\-*/ } matches the arithmetic operators + - * / { A-Z a-z 0-9 } matches all letters and digits - { \n } matches all characters except the newline character ANY matches all characters except the newline character - Two regular expressions separated by the operator '|' match characters that are matched by the first or by the second regular expressions. a | b matches the characters 'a' or 'b' Rex 6 - Two regular expressions following each other with no operator in between match the concatenation of character sequences matched by the single reg- ular expressions. a b matches the character sequence 'ab' - The operator '?' matches a character sequence matched by the preceding regular expression or the empty character sequence. In other words, the specified characters are optional. a b ? matches the character sequences 'a' and 'ab' - The operator '+' matches a character sequence which can be matched by the repetition of the preceding regular expression 1 or more times. a + matches the character sequences 'a', 'aa', 'aaa', ... - The operator '*' matches a character sequence which can be matched by the repetition of the preceding regular expression zero or more times. a b * matches the character sequences 'a', 'ab', 'abb', 'abbb', ... - A regular expression followed by a number in brackets matches a character sequence which can be matched by the repetition of the preceding regular expression exactly the times specified by the number. a [4] matches the character sequence 'aaaa' - A regular expression followed by a range in brackets matches a character sequence which can be matched by the repetition of the preceding regular expression a number of times lying in between of the two given numbers. a [2-4] matches the character sequences 'aa', 'aaa', and 'aaaa' - Parentheses '(' ')' can be used for grouping in more complex regular expressions. (a | b+)? (c d)*matches strings like 'acdcd', 'cdcdcd', 'bcd', or 'bbb'; but not 'ab', 'abb', or 'abcd'. A complete regular expression which is not part of any other regular expression is called a pattern. A pattern is matched exactly in the same way as regular expressions. It can be augmented by the following specifications. Rex 7 - A pattern preceded by the operator '<' matches a character sequence only if it appears at the beginning of a line. < {a-z} + matches identifiers only at the beginning of lines - A pattern followed by the operator '>' matches a character sequence only if it appears at the end of a line. " " + > matches trailing spaces < C ANY * > matches FORTRAN comment lines - A pattern followed by the operator '/' and a regular expression matches a character sequence only if it is followed by a character sequence that is matched by the regular expression behind the operator '/'. {0-9} + / ".." matches numbers, but only if followed by two dots - Several patterns that share a common action can be given in a comma separated list, thus the action has to be specified only once. ' - {\n'} * ', \" - {\n"} * \" matches both possible forms of Modula-2 strings 3.3. Ambiguous Specifications Rex can handle ambiguous specifications. When more than one expression can match the current input, Rex chooses as follows: - The longest match is preferred. - Among rules which match the same number of characters, the rule given first is preferred. The length of a match is the number of matched characters plus the number of characters matched by the regular expression following the "right context" operator '/' if applicable. Example: {0-9} + / ".." : { return SymDecimal; } {0-9} + "." {0-9} * : { return SymReal ; } ".." : { return SymRange ; } "." : { return SymDot ; } Suppose the right context of the first rule above is missing. The input 1.. Rex 8 would be recognized as SymReal and SymDot because SymReal matches two charac- ters. To get the right solution the right context is necessary. Now the input is recognized as SymDecimal and SymRange because SymDecimal matches 3 charac- ters. Example: BEGIN : { return SymBegin; } END : { return SymEnd ; } {A-Z} + : { return SymIdent; } The rules for keywords should be given before the rule for identifiers. Otherwise the keywords would be recognized as identifiers. 3.4. Definitions Regular expressions can be given names. This serves to avoid duplication of regular expressions or to increase the expressive power of a specification. After the keyword DEFINE a list of identifiers can be associated with regular expressions. Defined identifiers appearing within regular expressions are replaced by the regular expression given in the definition. Undefined iden- tifiers are treated as strings. The identifier ANY is predefined to match any character except newline. Example: letter = { A-Z a-z } . digit = { 0-9 } . string_character = - { " \n } . ANY = - { \n } . 3.5. Start States For complex tasks Rex offers a facility called "start states". Usually the generated scanner is always in the standard state called STD and all specified patterns are recognized. In general the scanner is allowed to change its state between an arbitrary number of user defined states. The patterns can be specified to be recognized only in certain states. Initially the scanner is in the standard start state STD. There are special statements to change the state of the scanner (see section 4.4.). They can be used in the actions of the rules. Start states have to be defined by giving a list of identifiers after the keyword START. The identifiers may be separated by commas. The standard state STD is predefined. - A pattern without given start states is recognized in every start state the scanner is in. - A pattern preceded by a list of start states (inclosed in '#' characters) is recognized only if the scanner is in one of the listed start states. Again the listed start states may be separated by commas. Rex 9 - A pattern preceded by the keyword NOT and a list of start states (enclosed in '#' characters) is recognized only if the scanner is in a start state not listed. Example: START comment RULE "(*" : {++ level; yyStart (comment);} #comment# "*)" : {-- level; if (level == 0) yyStart (STD);} #comment# "(" | "*" | - {*(} + : {} #STD# {0-9} + : {return SymNumber;} The above example shows how to handle nested comments in a Modula-2 scanner. The rule for opening comment brackets is recognized in all states. The nesting level is increased and we change the start state to comment with the predefined statement yyStart. Closing comment brackets are recognized only if the scanner is in start state comment. Upon their recognition the nesting level is decreased. Should the nesting level reach zero the comment is fin- ished and we change the state back to STD using yyStart again. While the scanner is in start state comment everything except opening and closing com- ment brackets is skipped by specifying an empty action. The last rule specify- ing the structure of decimal numbers is recognized only in the start state STD. The problem of how to declare the variable for counting the nesting level of comments is solved in section 3.7. 3.6. Scanner Name A specification may be optionally headed by a name for the scanner to be generated: Example: SCANNER lexer The identifier is used to derive the names of the scanner and source modules, the file name for the scanner tables, and a prefix for the objects exported by the scanner. If the name is missing it defaults to Scanner. In the following we refer to this name by <Scanner>. The prefixes <Scanner> and <Scanner>_ are generated only if this clause is present. Otherwise they are omitted in order to be compatible with former versions of Rex. 3.7. Target Code The actions associated with regular expressions may need variables or in general arbitrary declarations to perform their task. A scanner specification may be preceded by several kinds of sections written in the target language. The syntax rules for actions apply to these sections, too. These sections are copied unchanged and unchecked to the generated scanner at the following places: Rex 10 - Target code after the keyword EXPORT is included in the specification part (definition module) of the generated scanner. It allows to extend the set of objects exported by the scanner module. If not given it is predefined as described below. - Target code after the keyword GLOBAL is included in the scanner module at level 0, that is the extent of variables given in this section is the run time of the whole program. If not given it is predefined as described below. - Target code after the keyword LOCAL is included in the scanner routine called <Scanner>_GetToken (at level 1), that is the extent of variables given in this section is one invocation of this routine. - Target code after the keyword BEGIN is included in the routine <Scanner>_BeginScanner which may be called to initialize the data struc- tures declared in the sections EXPORT and GLOBAL. - Target code after the keyword CLOSE is included in the routine <Scanner>_CloseScanner which may be called after scanning is finished. This statements can be used to finalize the data structures declared in the sections EXPORT and GLOBAL. - Target code after the keyword DEFAULT is included in the scanner routine to be executed whenever a character is not matched by one of the regular expressions. It can be used to detect illegal characters for example. If not given it is predefined as described below. - Target code after the keyword EOF is included in the scanner routine to be executed upon reaching the end of the input. It can be used to return a value different from the predefined one (<Scanner>_EofToken = 0) or to check for unclosed comments or strings for example. If the EXPORT, GLOBAL, and DEFAULT sections are not used the following predefined declarations are included: Rex 11 If the target language is C: EXPORT { # include "Positions.h" typedef struct { tPosition Position; } <Scanner>_tScanAttribute; extern void <Scanner>_ErrorAttribute (int Token, <Scanner>_tScanAttribute * Attribute); } GLOBAL { void <Scanner>_ErrorAttribute (Token, Attribute) int Token; <Scanner>_tScanAttribute * Attribute; { } } DEFAULT { yyEcho; } If the target language is Modula-2: EXPORT { IMPORT Positions; TYPE tScanAttribute = RECORD Position: Positions.tPosition; END; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); } GLOBAL { PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); BEGIN END ErrorAttribute; } DEFAULT { yyEcho; } These two sections import the type tPosition from a module named Posi- tions and they declare the type <Scanner>_tScanAttribute as well as the pro- cedure <Scanner>_ErrorAttribute. These items are needed in combination with parser generators. A variable called <Scanner>_Attribute of type <Scanner>_tScanAttribute is used to communicate additional properties of the tokens from the scanner to the parser. The type <Scanner>_tScanAttribute has to be a struct (record) type with at least one member (field) called Position of type tPosition. tPosition has to be a struct (record) type with at least two members (fields) called Line and Column (see section 3.8.). It can be imported from the predefined module Positions or from a user modified version of it. During automatic error repair a parser may insert tokens. In this case the parser calls the procedure <Scanner>_ErrorAttribute to ask for the addi- tional properties of an inserted token which is given by the parameter Token. The types tPosition and <Scanner>_tScanAttribute are predefined as given above Rex 12 and the procedure <Scanner>_ErrorAttribute is empty. If only one of the sec- tions EXPORT or GLOBAL is used, it has to contain declarations consistent with the remaining predefined ones. 3.8. Source Position The generated scanners automatically compute the line and column position of every token. This position can be accessed via the fields Position.Line and Position.Column of the global variable <Scanner>_Attribute as described in the section about the Scanner Interface. The source position is computed automat- ically if the action of a rule is preceded by a colon like in all the examples so far. However, if the character '-' is appended to the colon, the calcula- tion of the source position can be disabled for a rule. There are mainly two reasons for not to compute the position. First, some "compound" tokens have to be recognized by the combination of several rules (usually in connection with a start state). In order to get the correct posi- tion, which is the position yielded by the first rule, the calculation of the position has to be disabled for the following rules. Example (Pascal strings): START string RULE #STD# ' : {yyStart (string);} #string# - {'\t\n} + :- {} #string# '' :- {} #string# ' :- {yyStart (STD); return SymString;} Second, there is no need to calculate the source position in rules that skip input characters without returning a token. In this case disabling the computation of the position yields an increase in run time efficiency. The typical examples are comments. The example given in the chapter about Start States should be rewritten as follows: Example (Modula-2 comments): START comment RULE "(*" :- {++ level; yyStart (comment);} #comment# "*)" :- {-- level; if (level == 0) yyStart (STD);} #comment# "(" | "*" | - {*(\t\n} + :- {} 4. Predefined Items Rex knows several predefined items described in the next sections. 4.1. Definitions The identifier ANY is predefined to match one arbitrary character except newline. Rex 13 DEFINE ANY = - { \n } . 4.2. Start States The identifier STD is predefined to denote the standard start state of Rex. The generated scanners are initially in this state. START STD 4.3. Rules The 4 for rules given below are predefined after the user specified rules. By giving own rules the user can overwrite these because of the stra- tegy to solve ambiguities. The predefined rules help to calculate the line and column positions and to skip blanks efficiently. RULE " " :- {} \t :- {yyTab;} \n :- {yyEol (0);} ANY :- {yyEcho;} 4.4. Action Statements The following statements can be used within the actions associated with regular expressions: <Scanner>_GetWord (v); This statement gives access to the matched character sequence. In C the sequence is returned in the variable v which must be of type char v [ ]. Additionally the length of the sequence is returned as result of the function. In Modula-2 the sequence is returned in the variable v which must be of type Strings.tString. <Scanner>_GetLower (v); Like <Scanner>_GetWord, except that every letter is normalized to lower case. <Scanner>_GetUpper (v); Like <Scanner>_GetWord, except that every letter is normalized to upper case. yyEcho; The matched character sequence is printed on standard output. yyLess (n); The matched character sequence is truncated to the first n charac- ters. The other characters are rescanned for the next character sequence. Rex 14 yyStart (s); The start state is changed to state s. yyPrevious; The start state is changed to the state valid before the last execution of yyStart or yyPrevious. yyStartState This is not a statement but an expression of type short or SHORTCARD, respectively, whose value is the current start state. It can be used to execute different statements in one action depending on the current start state. yyTab; This statement should be used if a regular expression is specified by the user to process tab characters. Its purpose is to update the internal variable to calculate the column position of tokens. yyTab works only if the tab character exclusively is specified by a rule. yyTab1 (a); Like yyTab this statement should be used if a regular expression is specified by the user to process tab characters. Its purpose is to update the internal variable to calculate the column position of tokens. yyTab1 works if the tab character is embedded in other characters. The parameter a must specify the number of characters before the tab character. yyEol (n); This statement should be used if a regular expression is specified by the user to process newline characters. Its purpose is to up- date the internal variables to calculate the line and column posi- tion of tokens. yyEol should be executed once for every newline character matched. The parameter n should specify the number of characters matched after the last newline character. In simple cases where the pattern consists only of a newline character one invocation of yyEol (0); is sufficient. 5. Interface of the Generated Scanners The scanners generated by Rex offer an interface to be used by a main program like e. g. a parser and they require a source module for blocked input of characters to obey a certain interface. The structure of these two inter- faces is independent from a specific target language. As the syntactic details vary from one target language to another we discuss the interfaces in the fol- lowing target language specific chapters. 5.1. C It has been already mentioned that the prefixes <Scanner> and <Scanner>_ are generated only if the keyword SCANNER is present. Otherwise they are omit- ted in order to be compatible with former versions of Rex. 5.1.1. Scanner Interface The scanners generated by Rex offer an interface given by the following specification file named <Scanner>.h: Rex 15 # include "Positions.h" typedef struct { tPosition Position; } <Scanner>_tScanAttribute; extern void <Scanner>_ErrorAttribute (int Token, <Scanner>_tScanAttribute * Attribute); # define <Scanner>_EofToken 0 extern char * <Scanner>_TokenPtr ; extern short <Scanner>_TokenLength ; extern <Scanner>_tScanAttribute <Scanner>_Attribute; extern void (* <Scanner>_Exit) () ; extern void <Scanner>_BeginScanner (); extern void <Scanner>_BeginFile (char * FileName); extern int <Scanner>_GetToken (); extern int <Scanner>_GetWord (char * Word); extern int <Scanner>_GetLower (char * Word); extern int <Scanner>_GetUpper (char * Word); extern void <Scanner>_CloseFile (); extern void <Scanner>_CloseScanner (); - The procedure <Scanner>_GetToken is the central scanning routine. It re- turns the next token found in the input or whatever is specified in the actions associated with the regular expressions. - The procedure <Scanner>_BeginFile may be called to open an input file or a nested include file. It has one parameter of type 'char *' (string) which specifies the file name. Include files up to a nesting depth of 15 can be processed. If not called input is read from standard input. - The procedure <Scanner>_CloseFile may be called to close the actual input file (before reaching end of file). <Scanner>_CloseFile is called au- tomatically by the scanner upon reaching end of file. - The procedure <Scanner>_BeginScanner may be called to initialize user data. - The procedure <Scanner>_CloseScanner may be called to finalize user data. - The procedures <Scanner>_GetWord, <Scanner>_GetLower, and <Scanner>_GetUpper allow access to the matched character sequence as described in section 4.4. - Alternatively, the matched character sequence can be accessed using the variables <Scanner>_TokenPtr and <Scanner>_TokenLength. <Scanner>_TokenPtr points to the beginning of the matched character se- quence. <Scanner>_TokenLength specifies the number of the matched charac- ters. Note, the matched character sequence is not terminated by a '\0' character. - The variable <Scanner>_Attribute is supposed to communicate additional properties of the last token. The value must be provided by appropriate action statements. This variable is of type <Scanner>_tScanAttribute Rex 16 which has to be a struct type with at least one member called Position of type tPosition. tPosition has to be a struct type with at least two members called Line and Column. The values of Line and Column are comput- ed by the scanner, automatically. They indicate the source position of the actual token. The position of a token is the position of the first character of the token. For exceptions see section 3.8. The types <Scanner>_tScanAttribute and tPosition are predefined as given above. The definitions of these types can be changed as described in section 3.7. - During automatic error repair a parser may insert tokens. In this case the parser calls the procedure <Scanner>_ErrorAttribute to ask for the additional properties of an inserted token which is given by the parame- ter Token. The procedure should return in the second argument called <Scanner>_Attribute a default value for the additional properties of the token Token. - The variable <Scanner>_Exit refers to a procedure which is called upon an internal error in the scanner. The default procedure terminates the pro- gram execution. The variable can be changed to achieve a different behaviour. - If the scanner reaches the end of the input it returns the special token called <Scanner>_EofToken which is encoded by 0. 5.1.2. Source Interface The scanners generated by Rex need a source module for blocked input of characters. Rex can provide a prototype source module which reads from stan- dard input or any file. It is contained in the files <Scanner>Source.h and <Scanner>Source.c. The specification file <Scanner>Source.h consists of some- thing like: extern int <Scanner>_BeginSource (char * FileName); extern int <Scanner>_GetLine (int File, char * Buffer, int Size); extern void <Scanner>_CloseSource (int File); - <Scanner>_BeginSource is called from the scanner in order to open files or to initialize any other source of input. If not called input is read from standard input. - <Scanner>_GetLine is called to fill a buffer starting at address 'Buffer' with a block of maximal 'Size' characters. Lines are terminated by new- line characters (ASCII = 0xa). <Scanner>_GetLine returns the number of characters transferred. Reasonable block sizes are between 128 and 2048 or the length of a line. Smaller block sizes - especially block size 1 - will drastically slow down the scanner. - <Scanner>_CloseSource is called from the scanner at end of file respec- tively at end of input. It can be used to close files. Rex 17 The body in the file <Scanner>Source.c has the following contents: # include "<Scanner>Source.h" # include "System.h" int <Scanner>_BeginSource (FileName) char * FileName; { return OpenInput (FileName); } int <Scanner>_GetLine (File, Buffer, Size) int File; char * Buffer; int Size; { register int n = Read (File, Buffer, Size); # ifdef Dialog # define IgnoreChar ' ' /* Add dummy after newline character in order to supply a lookahead for rex. */ /* This way newline tokens are recognized without typing an extra line. */ if (n > 0 && Buffer [n - 1] == '0) Buffer [n ++] = IgnoreChar; # endif return n; } void <Scanner>_CloseSource (File) int File; { Close (File); } The newline character may constitute a token of its own in applications such as dialog programs. Like for every other token, Rex needs at least a look-ahead of one character to recognize this token. Therefore the user has to type not only one extra character but a complete extra input line because usu- ally input is line buffered by the operating system. This behaviour is un- desirable. The problem can be solved by compiling the file <Scanner>Source.c with the option -DDialog. This variant adds a dummy character after the new- line character to serve as lookahead. The dummy character should be a charac- ter that is ignored such as e. g. a blank. 5.1.3. Scanner Driver To test a generated scanner a main program is necessary. Rex can provide a minimal main program in the file <Scanner>Drv.c which can serve as test driver. It counts the tokens and looks like the following: Rex 18 # include "Positions.h" # include "<Scanner>.h" main () { int Token, Count = 0; char Word [256]; <Scanner>_BeginScanner (); do { Token = <Scanner>_GetToken (); Count ++; # ifdef Debug if (Token != <Scanner>_EofToken) (void) <Scanner>_GetWord (Word); else Word [0] = '\0'; WritePosition (stdout, <Scanner>_Attribute.Position); (void) printf ("%5d %s\n", Token, Word); # endif } while (Token != <Scanner>_EofToken); <Scanner>_CloseScanner (); (void) printf ("%d\n", Count); return 0; } 5.2. Modula-2 5.2.1. Scanner Interface The scanners generated by Rex offer an interface given by the following definition module named <Scanner>.md: DEFINITION MODULE <Scanner>; IMPORT Positions, Strings; TYPE tScanAttribute = RECORD Position: Positions.tPosition; END; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); CONST EofToken = 0; VAR TokenLength : INTEGER; VAR Attribute : tScanAttribute; VAR ScanTabName : ARRAY [0 .. 127] OF CHAR; VAR Exit : PROC; PROCEDURE BeginScanner ; PROCEDURE BeginFile (FileName: ARRAY OF CHAR); PROCEDURE GetToken (): INTEGER; PROCEDURE GetWord (VAR Word: Strings.tString); PROCEDURE GetLower (VAR Word: Strings.tString); PROCEDURE GetUpper (VAR Word: Strings.tString); PROCEDURE CloseFile ; PROCEDURE CloseScanner ; END <Scanner>. Rex 19 - The procedure GetToken is the central scanning routine. It returns the next token found in the input or whatever is specified in the actions as- sociated with the regular expressions. - The array ScanTabName specifies the name of the file containing the scanner tables. It is initialized with the string "Scan.Tab". Therefore, the scanner tables are read by default from a file with this name in the current directory. If a different name or location is desired an arbi- trary path name can be assigned to this array before calling GetToken the first time. - The procedure BeginFile may be called to open an input file or a nested include file. The parameter FileName specifies the file name. Include files up to a nesting depth of 15 can be processed. If not called input is read from standard input. - The procedure CloseFile may be called to close the actual input file (be- fore reaching end of file). CloseFile is called automatically by the scanner upon reaching end of file. - The procedure BeginScanner may be called to initialize user data. - The procedure CloseScanner may be called to finalize user data. - The procedures GetWord, GetLower, and GetUpper allow access to the matched character sequence as described in section 4.4. - The variable TokenLength specifies the number of the matched characters. - The variable Attribute is supposed to communicate additional properties of the last token. The value must be provided by appropriate action statements. This variable is of type tScanAttribute which has to be a record type with at least one field called Position of type tPosition. tPosition has to be a record type with at least two fields called Line and Column. The values of Line and Column are computed by the scanner, automatically. They indicate the source position of the actual token. The position of a token is the position of the first character of the token. For exceptions see section 3.8. The types tScanAttribute and tPosition are predefined as given above. The definitions of these types can be changed as described in section 3.7. - During automatic error repair a parser may insert tokens. In this case the parser calls the procedure ErrorAttribute to ask for the additional properties of an inserted token which is given by the parameter Token. The procedure should return in the second argument called Attribute a de- fault value for the additional properties of the token Token. - The variable Exit refers to a procedure which is called upon an internal error in the scanner. The default procedure terminates the program execu- tion. The variable can be changed to achieve a different behaviour. - If the scanner reaches the end of the input it returns the special token called EofToken which is encoded by 0. Rex 20 5.2.2. Source Interface The scanners generated by Rex need a source module for blocked input of characters. Rex can provide a prototype source module which reads from stan- dard input. It is contained in the files <Scanner>Source.md and <Scanner>Source.mi. The definition module in the file <Scanner>Source.md has the following contents: DEFINITION MODULE <Scanner>Source; FROM SYSTEM IMPORT ADDRESS; FROM System IMPORT tFile; PROCEDURE BeginSource (FileName: ARRAY OF CHAR): tFile; PROCEDURE GetLine (File: tFile; Buffer: ADDRESS; Size: CARDINAL): INTEGER; PROCEDURE CloseSource (File: tFile); END <Scanner>Source. - BeginSource is called from the scanner in order to open files or to ini- tialize any other source of input. If not called input is read from standard input. - GetLine is called to fill a buffer starting at address 'Buffer' with a block of maximal 'Size' characters. Lines are terminated by newline char- acters (ASCII = 12C). GetLine returns the number of characters transferred. Reasonable block sizes are between 128 and 2048 or the length of a line. Smaller block sizes - especially block size 1 - will drastically slow down the scanner. - CloseSource is called from the scanner at end of file respectively at end of input. It can be used to close files. The implementation module in the file <Scanner>Source.mi has the follow- ing contents: Rex 21 IMPLEMENTATION MODULE <Scanner>Source; FROM SYSTEM IMPORT ADDRESS; FROM System IMPORT tFile, OpenInput, Read, Close; PROCEDURE BeginSource (FileName: ARRAY OF CHAR): tFile; BEGIN RETURN OpenInput (FileName); END BeginSource; PROCEDURE GetLine (File: tFile; Buffer: ADDRESS; Size: CARDINAL): INTEGER; CONST IgnoreChar = ' '; VAR n : INTEGER; VAR BufferPtr: POINTER TO ARRAY [0..30000] OF CHAR; BEGIN (* # ifdef Dialog n := Read (File, Buffer, Size); (* Add dummy after newline character in order to supply a lookahead for rex. *) (* This way newline tokens are recognized without typing an extra line. *) BufferPtr := Buffer; IF (n > 0) AND (BufferPtr^[n - 1] = 012C) THEN BufferPtr^[n] := IgnoreChar; INC (n); END; RETURN n; # else *) RETURN Read (File, Buffer, Size); (* # endif *) END GetLine; PROCEDURE CloseSource (File: tFile); BEGIN Close (File); END CloseSource; END <Scanner>Source. The newline character may constitute a token of its own in applications such as dialog programs. Like for every other token, Rex needs at least a look-ahead of one character to recognize this token. Therefore the user has to type not only one extra character but a complete extra input line because usu- ally input is line buffered by the operating system. This behaviour is un- desirable. The problem can be solved by modifying the procedure GetLine in the file <Scanner>Source.mi. The variant in the comment (* # ifdef Dialog ... # else *) adds a dummy character after the newline character to serve as looka- head. The dummy character should be a character that is ignored such as e. g. a blank. 5.2.3. Scanner Driver To test a generated scanner a main program is necessary. Rex can provide a minimal main program in the file <Scanner>Drv.mi which can serve as test driver. It counts the tokens and looks like the following: MODULE <Scanner>Drv; FROM <Scanner> IMPORT BeginScanner, GetToken, GetWord, Attribute, EofToken, Rex 22 CloseScanner; FROM Strings IMPORT tString, WriteL; FROM IO IMPORT StdOutput, WriteI, WriteC, WriteNl, CloseIO; FROM Positions IMPORT WritePosition; FROM System IMPORT Exit; VAR Token : INTEGER; Word : tString; Debug : BOOLEAN; Count : INTEGER; BEGIN Debug := FALSE; Count := 0; BeginScanner; REPEAT Token := GetToken (); INC (Count); IF Debug THEN GetWord (Word); WritePosition (StdOutput, Attribute.Position); WriteI (StdOutput, Token, 5); WriteC (StdOutput, ' '); WriteL (StdOutput, Word); END; UNTIL Token = EofToken; CloseScanner; WriteI (StdOutput, Count, 0); WriteNl (StdOutput); CloseIO; Exit (0); END <Scanner>Drv. 6. Usage NAME rex - generator of lexical analysers SYNOPSIS rex [ -options ] [ -ldir ] [ file ] DESCRIPTION Rex generates programs to be used in lexical analysis of text. A typical application is the generation of scanners for compilers. The input file contains regular expressions to be searched for, and actions written in C or Modula-2 to be executed when strings according to the expressions are found. Unrecognized portions of the input are copied to standard output. To be able to recognize tokens depending on their context, Rex provides start states to handle left context and the right context can be specified by an additional regular expression. If several regular expressions match the input characters, the longest match is preferred. If there are still several possibilities, the regular expression given first in the specifica- tion is chosen. Rex 23 Rex generated scanners automatically provide the line and column position of each token. For languages like Pascal and Ada where the case of letters is insignificant tokens can be normalized to lower or upper case. There are predefined rules to skip white space like blanks, tabs, or newlines and there is a mechanism to handle include files. The generated scanners are table-driven deterministic finite automatons. OPTIONS a generate all (= sdm) m generate a lexical analyser in Modula-2 (default) c generate a lexical analyser in C d generate a definition module for the lexical analyser s generate support modules: - a source module for input - a main program to be used as test driver r reduce the number of generated case/switch labels. Might be necessary due to compiler restrictions. Effects: slower scanner (2-4%), larger tables, same scanner size. i use ISO 8 bit code (default: ASCII 7 bit code) o optimize table size Effects: slower scanner (0-15%), small tables, long generation time (fac- tor 1-10) n do not optimize table size Effects: fast scanner, large tables (factor 1-10), short generation time default: improve table size Effects: slower scanner (0-5%), medium size tables (factor 1-2), medium generation time (factor 1-2) w suppress warnings g generate # line directives b do not partition charcater set into blocks 1 print statistics about the generated lexical analyser ldir dir is the directory where Rex finds its table and data files FILES if output is in C: <Scanner>.h specification of the generated scanner <Scanner>.c body of the generated scanner Rex 24 <Scanner>Source.h specification of support module source <Scanner>Source.c body of support module source <Scanner>Drv.c main program to serve as test driver if output is in Modula-2: <Scanner>.md definition module of the generated scanner <Scanner>.mi implementation module of the generated scanner <Scanner>Source.md definition module of support module source <Scanner>Source.mi implementation module of support module source <Scanner>Drv.mi main program to serve as test driver <Scanner>.Tab tables to control the generated scanner SEE ALSO J. Grosch: "Rex - A Scanner Generator", GMD Forschungsstelle an der Universitaet Karlsruhe, Compiler Generation Report No. 5, 1987 J. Grosch: "Efficient Generation of Lexical Analysers", Software - Prac- tice & Experience, 19 (11), 1089-1103, Nov. 1989 7. Implementation Rex is implemented by a 5,000 line Modula-2 program. The program makes heavy use of a library of reusable Modula-2 modules currently comprising 3,000 lines of code [Gro87]. Of the 5,000 lines of Rex 1,500 lines are generated by tools: - 500 lines for the scanner are generated by Rex itself. - 1000 lines for the parser are generated by the LALR(1) parser generator lalr. How can Rex generate a part of itself before its existence? Well, the scanner has been bootstrapped using LEX. The first version of the scanner was a separate C program generated by LEX which wrote the internal representation of the tokens on a file. A simple hand written scanner read the tokens from this file during construction of Rex. After Rex was operational it could gen- erate its own scanner in Modula-2. And how is Rex working? It differentiates between constant regular ex- pressions and non-constant ones as defined in [Gro89]. The non-constant regu- lar expressions constitute a nondeterministic finite automaton. The so-called subset construction algorithm is used for conversion into a deterministic fin- ite automaton. Then an algorithm to minimize the number of states is applied. After extending the automaton to a tunnel automaton the constant regular ex- pressions are added in linear time using the algorithm described in [Gro89]. The sparse matrix to control the automaton is compressed into a data structure called "comb vector" [ASU86] to save space. The key to the performance of scanners generated by Rex lies in the fol- lowing facts: - access to the "comb vector" table is fast Rex 25 - input happens rarely because blocks of characters are transferred - no check for the last character of a block is necessary because of the sentinel technique used - the same holds for the check of stack underflow for the stack to record the passed states - the treatment of right context is efficient and only necessary in a few cases because partial evaluation has been applied 8. Differences to LEX Some specialists might want to know about the differences between Rex and LEX [Les75] (see Table): Advantages of Rex: + The standard or initial start state has a documented name: STD. + The list of start states can be inverted using the operator NOT to speci- fy that a rule is valid in all states except the listed ones. + The specifications can be written unformatted - white space in the form of blanks, tabs, and newlines is skipped. + Identifiers used to refer to named regular expressions are written without enclosing braces '{' '}'. Table: Syntactical differences between Rex and LEX: Meaning LEX Rex ______________________________________________________________________________________ delimiter for character classes [ ] { } complement of character classes [^ ] - { } any character . ANY left justification ^ < right justification $ > replicator {n} [n] replicator {m,n} [m-n] delimiter for start states < > # # escape representation for characters \octal \decimal scanner routine yylex <Scanner>_GetToken access to matched string yytext <Scanner>_GetWord ( ) length of matched string yyleng result of <Scanner>_GetWord ( ) output of matched string ECHO yyEcho retain part of matched string yyless yyLess initial start state INITIAL STD change of start state BEGIN yyStart ( ) Rex 26 + Rex automatically calculates the source position of the tokens in the fields Line and Column of the variable <Scanner>_Attribute. + There are predefined rules to skip the white space characters. + Include files up to a nesting depth of 15 can be processed. + Routines are provided to normalize tokens to upper or lower case charac- ters. + No adjustment of the internal data structures are necessary to be able to process large specifications. Disadvantages of Rex: - The action statement yymore is not available. - The action statement REJECT is not available - Rex can only find one solution and not all like LEX. - The redirection of input with the procedure yywrap is not available. - The character set is fixed to the one of the host computer. There is no way to specify a different character set to be able to generate scanners for target computers with a different character set. Rex 27 Appendix 1: Syntax of the Specification Language specification : [name] [code] [define] [start] rules . name : SCANNER [Ident] . code : | code EXPORT TargetCode | code GLOBAL TargetCode | code LOCAL TargetCode | code BEGIN TargetCode | code CLOSE TargetCode | code DEFAULT TargetCode | code EOF TargetCode . define : DEFINE definition * . start : START identList . rules : RULE rule * | RULES rule * . definition : Ident '=' regExpr '.' . identList : Ident | identList Ident | identList ',' Ident . rule : patternList ':' TargetCode | patternList ':-' TargetCode . patternList : pattern | patternList ',' pattern . pattern : [startStates] ['<'] regExpr ['/' regExpr] ['>'] . startStates : '#' identList '#' | NOT '#' identList '#' . regExpr : regExpr '|' regExpr | regExpr regExpr | regExpr '+' | regExpr '*' | regExpr '?' | regExpr '[' Number ']' | regExpr '[' Number '-' Number ']' | '(' regExpr ')' | charSet | Char | Ident | String | Number Rex 28 . charSet : '-' charSet | '{' range * '}' . range : Char | Char '-' Char . Char : character | '\' digit + | '\' n | '\' t | '\' v | '\' b | '\' r | '\' f | '\' character . Ident : letter letter_or_digit * . letter_or_digit : letter | digit | '_' . String : '"' character * '"' . Number : digit + . Target_code : '{' character * '}' . Rex 29 Appendix 2: Example Specification of a Modula-2-Scanner in C GLOBAL { # include "Memory.h" # include "StringMem.h" # include "Idents.h" int level = 0; void ErrorAttribute (Token, Attribute) int Token; tScanAttribute Attribute; { } } LOCAL { char Word [256]; tIdent ident ; tStringRef ref ; int length ; } DEFAULT { printf ("illegal character: "); yyEcho; printf ("\n"); } DEFINE digit = {0-9} . letter = {a-z A-Z} . cmt = - {*(\t\n} . START comment RULE "(*" :- {++ level; yyStart (comment);} #comment# "*)" :- {-- level; if (level == 0) yyStart (STD);} #comment# "(" | "*" | cmt + :- {} /* The procedure PutString is imported from the module StringMem(ory). It is used to store the string representation of some tokens. */ #STD# digit + , #STD# digit + / ".." : {length = GetWord (Word); ref = PutString (Word, length); return 1;} #STD# {0-7} + B : {length = GetWord (Word); ref = PutString (Word, length); return 2;} #STD# {0-7} + C : {length = GetWord (Word); ref = PutString (Word, length); return 3;} #STD# digit {0-9 A-F} * H : { length = GetWord (Word); ref = PutString (Word, length); return 4;} Rex 30 #STD# digit + "." digit * (E {+\-} ? digit +) ? : { length = GetWord (Word); ref = PutString (Word, length); return 5;} #STD# ' - {\n'} * ' | \" - {\n"} * \" : {length = GetWord (Word); ref = PutString (Word, length); return 6;} #STD# "#" : {return 7;} #STD# "&" : {return 8;} #STD# "(" : {return 9;} #STD# ")" : {return 10;} #STD# "*" : {return 11;} #STD# "+" : {return 12;} #STD# "," : {return 13;} #STD# "-" : {return 14;} #STD# "." : {return 15;} #STD# ".." : {return 16;} #STD# "/" : {return 17;} #STD# ":" : {return 18;} #STD# ":=" : {return 19;} #STD# ";" : {return 20;} #STD# "<" : {return 21;} #STD# "<=" : {return 22;} #STD# "<>" : {return 23;} #STD# "=" : {return 24;} #STD# ">" : {return 25;} #STD# ">=" : {return 26;} #STD# "[" : {return 27;} #STD# "]" : {return 28;} #STD# "^" : {return 29;} #STD# "{" : {return 30;} #STD# "|" : {return 31;} #STD# "}" : {return 32;} #STD# "~" : {return 33;} #STD# AND : {return 34;} #STD# ARRAY : {return 35;} #STD# BEGIN : {return 36;} #STD# BY : {return 37;} #STD# CASE : {return 38;} #STD# CONST : {return 39;} #STD# DEFINITION : {return 40;} #STD# DIV : {return 41;} #STD# DO : {return 42;} #STD# ELSE : {return 43;} #STD# ELSIF : {return 44;} #STD# END : {return 45;} #STD# EXIT : {return 46;} #STD# EXPORT : {return 47;} #STD# FOR : {return 48;} #STD# FROM : {return 49;} Rex 31 #STD# IF : {return 50;} #STD# IMPLEMENTATION : {return 51;} #STD# IMPORT : {return 52;} #STD# IN : {return 53;} #STD# LOOP : {return 54;} #STD# MOD : {return 55;} #STD# MODULE : {return 56;} #STD# \NOT : {return 57;} #STD# OF : {return 58;} #STD# OR : {return 59;} #STD# POINTER : {return 60;} #STD# PROCEDURE : {return 61;} #STD# QUALIFIED : {return 62;} #STD# RECORD : {return 63;} #STD# REPEAT : {return 64;} #STD# RETURN : {return 65;} #STD# SET : {return 66;} #STD# THEN : {return 67;} #STD# TO : {return 68;} #STD# TYPE : {return 69;} #STD# UNTIL : {return 70;} #STD# VAR : {return 71;} #STD# WHILE : {return 72;} #STD# WITH : {return 73;} #STD# letter (letter | digit) * : { ident = MakeIdent (TokenPtr, TokenLength); return 74;} Rex 32 Appendix 3: Example Specification of a Modula-2-Scanner in Modula-2 GLOBAL { FROM Strings IMPORT tString ; FROM StringMem IMPORT tStringRef , PutString ; FROM Idents IMPORT tIdent , MakeIdent ; VAR level : CARDINAL; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); BEGIN END ErrorAttribute; } LOCAL { VAR Word : tString; ident : tIdent; ref : tStringRef; } BEGIN { level := 0; } DEFAULT { IO.WriteS (IO.StdOutput, "illegal character: "); yyEcho; IO.WriteNl (IO.StdOutput); } DEFINE digit = {0-9} . letter = {a-z A-Z} . cmt = - {*(\t\n} . START comment RULE "(*" :- {INC (level); yyStart (comment);} #comment# "*)" :- {DEC (level); IF level = 0 THEN yyStart (STD); END;} #comment# "(" | "*" | cmt + :- {} #STD# digit + , #STD# digit + / ".." : {GetWord (Word); ref := PutString (Word); RETURN 1;} #STD# {0-7} + B : {GetWord (Word); ref := PutString (Word); RETURN 2;} #STD# {0-7} + C : {GetWord (Word); ref := PutString (Word); RETURN 3;} #STD# digit {0-9 A-F} * H : { GetWord (Word); ref := PutString (Word); RETURN 4;} #STD# digit + "." digit * (E {+\-} ? digit +) ? : { GetWord (Word); ref := PutString (Word); Rex 33 RETURN 5;} #STD# ' - {\n'} * ' | \" - {\n"} * \" : {GetWord (Word); ref := PutString (Word); RETURN 6;} #STD# "#" : {RETURN 7;} #STD# "&" : {RETURN 8;} #STD# "(" : {RETURN 9;} #STD# ")" : {RETURN 10;} #STD# "*" : {RETURN 11;} #STD# "+" : {RETURN 12;} #STD# "," : {RETURN 13;} #STD# "-" : {RETURN 14;} #STD# "." : {RETURN 15;} #STD# ".." : {RETURN 16;} #STD# "/" : {RETURN 17;} #STD# ":" : {RETURN 18;} #STD# ":=" : {RETURN 19;} #STD# ";" : {RETURN 20;} #STD# "<" : {RETURN 21;} #STD# "<=" : {RETURN 22;} #STD# "<>" : {RETURN 23;} #STD# "=" : {RETURN 24;} #STD# ">" : {RETURN 25;} #STD# ">=" : {RETURN 26;} #STD# "[" : {RETURN 27;} #STD# "]" : {RETURN 28;} #STD# "^" : {RETURN 29;} #STD# "{" : {RETURN 30;} #STD# "|" : {RETURN 31;} #STD# "}" : {RETURN 32;} #STD# "~" : {RETURN 33;} #STD# AND : {RETURN 34;} #STD# ARRAY : {RETURN 35;} #STD# BEGIN : {RETURN 36;} #STD# BY : {RETURN 37;} #STD# CASE : {RETURN 38;} #STD# CONST : {RETURN 39;} #STD# DEFINITION : {RETURN 40;} #STD# DIV : {RETURN 41;} #STD# DO : {RETURN 42;} #STD# ELSE : {RETURN 43;} #STD# ELSIF : {RETURN 44;} #STD# END : {RETURN 45;} #STD# EXIT : {RETURN 46;} #STD# EXPORT : {RETURN 47;} #STD# FOR : {RETURN 48;} #STD# FROM : {RETURN 49;} #STD# IF : {RETURN 50;} #STD# IMPLEMENTATION : {RETURN 51;} #STD# IMPORT : {RETURN 52;} Rex 34 #STD# IN : {RETURN 53;} #STD# LOOP : {RETURN 54;} #STD# MOD : {RETURN 55;} #STD# MODULE : {RETURN 56;} #STD# \NOT : {RETURN 57;} #STD# OF : {RETURN 58;} #STD# OR : {RETURN 59;} #STD# POINTER : {RETURN 60;} #STD# PROCEDURE : {RETURN 61;} #STD# QUALIFIED : {RETURN 62;} #STD# RECORD : {RETURN 63;} #STD# REPEAT : {RETURN 64;} #STD# RETURN : {RETURN 65;} #STD# SET : {RETURN 66;} #STD# THEN : {RETURN 67;} #STD# TO : {RETURN 68;} #STD# TYPE : {RETURN 69;} #STD# UNTIL : {RETURN 70;} #STD# VAR : {RETURN 71;} #STD# WHILE : {RETURN 72;} #STD# WITH : {RETURN 73;} #STD# letter (letter | digit) * : { GetWord (Word); ident := MakeIdent (Word); RETURN 74;} Rex 35 Appendix 4: Example Specification of a Scanner for Rex EXPORT { FROM Idents IMPORT tIdent ; FROM StringMem IMPORT tStringRef; FROM Texts IMPORT tText ; FROM Positions IMPORT tPosition; TYPE tScanAttribute = RECORD Position : tPosition ; CASE : INTEGER OF | 1: Ident : tIdent ; | 2: Number : SHORTCARD ; | 3: String : tStringRef ; | 4: Ch : CHAR ; | 5: Text : tText ; END; END; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); } GLOBAL { FROM SYSTEM IMPORT ADDRESS; FROM Strings IMPORT tString, Concatenate, Char, SubString, StringToInt, AssignEmpty, Length; FROM Texts IMPORT MakeText, Append; FROM StringMem IMPORT tStringRef, PutString; FROM Idents IMPORT tIdent, MakeIdent, NoIdent; FROM Errors IMPORT ErrorMessage, Error; FROM ScanGen IMPORT Language, tLanguage; FROM Positions IMPORT tPosition; CONST SymIdent = 1 ; SymNumber = 2 ; SymString = 3 ; SymChar = 4 ; SymTargetcode = 5 ; SymScanner = 37 ; SymExport = 32 ; SymGlobal = 6 ; SymLocal = 31 ; SymBegin = 7 ; SymClose = 8 ; SymEof = 34 ; SymDefault = 36 ; SymDefine = 9 ; SymStart = 10 ; Rex 36 SymRules = 11 ; SymNot = 30 ; SymDot = 12 ; SymComma = 13 ; SymEqual = 14 ; SymColon = 15 ; SymColonMinus = 35 ; SymNrSign = 33 ; SymSlash = 16 ; SymBar = 17 ; SymPlus = 18 ; SymMinus = 19 ; SymAsterisk = 20 ; SymQuestion = 21 ; SymLParen = 22 ; SymRParen = 23 ; SymLBracket = 24 ; SymRBracket = 25 ; SymLBrace = 26 ; SymRBrace = 27 ; SymLess = 28 ; SymGreater = 29 ; BraceMissing = 13 ; UnclosedComment = 14 ; UnclosedString = 16 ; VAR level : INTEGER ; string : tString ; NoString : tStringRef ; Position : tPosition ; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); BEGIN CASE Token OF | SymIdent : Attribute.Ident := NoIdent; | SymNumber : Attribute.Number := 0; | SymString : Attribute.String := NoString; | SymChar : Attribute.Ch := '?'; | SymTargetcode : MakeText (Attribute.Text); ELSE END; END ErrorAttribute; } LOCAL { VAR TargetCode, String, Word: tString; PrevState: SHORTCARD; } BEGIN { level := 0; AssignEmpty (string); NoString := PutString (string); } Rex 37 EOF { CASE yyStartState OF | targetcode , set : ErrorMessage (BraceMissing , Error, Attribute.Position); | comment : ErrorMessage (UnclosedComment , Error, Attribute.Position); | CStr1, CStr2, Str1, Str2 : ErrorMessage (UnclosedString , Error, Attribute.Position); ELSE END; yyStart (STD); } DEFINE letter = {A-Z a-z} . digit = {0-9} . string = - {"\n} . cmtch = - {*\t\n} . code = - {{\}\t\n\\'"} . StrCh1 = - {'\t\n} . StrCh2 = - {"\t\n} . CStrCh1 = - {'\t\n\\} . CStrCh2 = - {"\t\n\\} . START targetcode, set, rules, comment, Str1, Str2, CStr1, CStr2 RULES #targetcode# "{" : { IF level = 0 THEN MakeText (Attribute.Text); AssignEmpty (TargetCode); Position := Attribute.Position; ELSE GetWord (Word); Concatenate (TargetCode, Word); END; INC (level); } #targetcode# "}" :- { DEC (level); IF level = 0 THEN yyStart (PrevState); Append (Attribute.Text, TargetCode); Attribute.Position := Position; RETURN SymTargetcode; ELSE GetWord (Word); Concatenate (TargetCode, Word); END; } #targetcode# code + :- { Rex 38 IF level > 0 THEN GetWord (Word); Concatenate (TargetCode, Word); END; } #targetcode# \t :- { IF level > 0 THEN Strings.Append (TargetCode, 11C); END; yyTab; } #targetcode# \n :- { IF level > 0 THEN Append (Attribute.Text, TargetCode); AssignEmpty (TargetCode); END; yyEol (0); } #targetcode# \\ ANY :- { IF level > 0 THEN GetWord (Word); Strings.Append (TargetCode, Char (Word, 2)); END; } #targetcode# \\ :- { IF level > 0 THEN Strings.Append (TargetCode, '\'); END; } #targetcode# ' : { GetWord (String); IF Language = C THEN yyStart (CStr1); ELSE yyStart (Str1); END; } #targetcode# \" : { GetWord (String); IF Language = C THEN yyStart (CStr2); ELSE yyStart (Str2); END; } #Str1# StrCh1 + , #Str2# StrCh2 + , #CStr1# CStrCh1 + | \\ ANY ? , Rex 39 #CStr2# CStrCh2 + | \\ ANY ? :- {GetWord (Word); Concatenate (String, Word);} #CStr1# \\ \n , #CStr2# \\ \n :- {GetWord (Word); Concatenate (String, Word); yyEol (0);} #Str1, CStr1# ' , #Str2, CStr2# \" :- {Strings.Append (String, Char (String, 1)); yyPrevious; Concatenate (TargetCode, String); } #Str1, Str2, CStr1, CStr2# \t :- {Strings.Append (String, 11C); yyTab;} #Str1, Str2, CStr1, CStr2# \n :- { ErrorMessage (UnclosedString, Error, Attribute.Position); Strings.Append (String, Char (String, 1)); yyEol (0); yyPrevious; Concatenate (TargetCode, String); } #STD, rules# "/*" :- {yyStart (comment) ;} #comment# "*" | cmtch + :- {} #comment# "*/" :- {yyPrevious ;} #STD# EXPORT : {PrevState := STD; yyStart (targetcode); RETURN SymExport ;} #STD# GLOBAL : {PrevState := STD; yyStart (targetcode); RETURN SymGlobal ;} #STD# LOCAL : {PrevState := STD; yyStart (targetcode); RETURN SymLocal ;} #STD# BEGIN : {PrevState := STD; yyStart (targetcode); RETURN SymBegin ;} #STD# CLOSE : {PrevState := STD; yyStart (targetcode); RETURN SymClose ;} #STD# DEFAULT : {PrevState := STD; yyStart (targetcode); RETURN SymDefault ;} #STD# EOF : {PrevState := STD; yyStart (targetcode); RETURN SymEof ;} #STD# SCANNER : {RETURN SymScanner ;} #STD# DEFINE : {RETURN SymDefine ;} #STD# START : {RETURN SymStart ;} #STD# RULE S ?: {yyStart (rules); RETURN SymRules ;} #rules# \NOT : {RETURN SymNot ;} #STD, rules# letter (letter | digit | _) * : { GetWord (Word); Attribute.Ident := MakeIdent (Word); RETURN SymIdent; } #STD, rules# digit + : { GetWord (Word); Attribute.Number := StringToInt (Word); RETURN SymNumber; } Rex 40 #STD, rules# \" string * \" : { GetWord (Word); SubString (Word, 2, Length (Word) - 1, TargetCode); Attribute.String := PutString (TargetCode); RETURN SymString; } #STD# "." : {RETURN SymDot ;} #STD# "=" : {RETURN SymEqual ;} #STD, set# "}" : {yyPrevious; RETURN SymRBrace ;} #STD, set, rules# "-" : {RETURN SymMinus ;} #STD, rules# "," : {RETURN SymComma ;} #STD, rules# "|" : {RETURN SymBar ;} #STD, rules# "+" : {RETURN SymPlus ;} #STD, rules# "*" : {RETURN SymAsterisk ;} #STD, rules# "?" : {RETURN SymQuestion ;} #STD, rules# "(" : {RETURN SymLParen ;} #STD, rules# ")" : {RETURN SymRParen ;} #STD, rules# "[" : {RETURN SymLBracket ;} #STD, rules# "]" : {RETURN SymRBracket ;} #STD, rules# "{" : {yyStart (set); RETURN SymLBrace ;} #rules# "#" : {RETURN SymNrSign ;} #rules# "/" : {RETURN SymSlash ;} #rules# "<" : {RETURN SymLess ;} #rules# ">" : {RETURN SymGreater ;} #rules# ":" : {PrevState := rules; yyStart (targetcode); RETURN SymColon ;} #rules# ":-" : {PrevState := rules; yyStart (targetcode); RETURN SymColonMinus ;} #STD, set, rules# \\ n : {Attribute.Ch := 012C; RETURN SymChar;} #STD, set, rules# \\ t : {Attribute.Ch := 011C; RETURN SymChar;} #STD, set, rules# \\ v : {Attribute.Ch := 013C; RETURN SymChar;} #STD, set, rules# \\ b : {Attribute.Ch := 010C; RETURN SymChar;} #STD, set, rules# \\ r : {Attribute.Ch := 015C; RETURN SymChar;} #STD, set, rules# \\ f : {Attribute.Ch := 014C; RETURN SymChar;} #STD, set, rules# \\ digit + : { GetWord (Word); SubString (Word, 2, Length (Word), TargetCode); Attribute.Ch := CHR (CARDINAL (StringToInt (TargetCode))); RETURN SymChar; } #STD, set, rules# \\ ANY : { GetWord (Word); Attribute.Ch := Char (Word, 2); RETURN SymChar; } #STD, set, rules# - {\t\n\ \f\r} : { GetWord (Word); Attribute.Ch := Char (Word, 1); Rex 41 RETURN SymChar; } \f :- {} \r :- {} References [ASU86] A. V. Aho, R. Sethi and J. D. Ullman, Compilers: Principles, Techniques, and Tools, Addison Wesley, Reading, MA, 1986. [Gro87] J. Grosch, Reusable Software - A Collection of Modula-Modules, Compiler Generation Report No. 4, GMD Forschungsstelle an der Universitat Karlsruhe, Sep. 1987. [Gro89] J. Grosch, Efficient Generation of Lexical Analysers, Software-Practice & Experience 19, 11 (Nov. 1989), 1089-1103. [Les75] M. E. Lesk, LEX - A Lexical Analyzer Generator, Computing Science Technical Report 39, Bell Telephone Laboratories, Murray Hill, NJ, 1975. Rex 1 Contents 1. Introduction .................................................... 1 2. Overview ........................................................ 1 3. Specification Language .......................................... 2 3.1. Lexical Conventions ............................................. 2 3.2. Regular Expressions ............................................. 4 3.3. Ambiguous Specifications ........................................ 7 3.4. Definitions ..................................................... 8 3.5. Start States .................................................... 8 3.6. Scanner Name .................................................... 9 3.7. Target Code ..................................................... 9 3.8. Source Position ................................................. 12 4. Predefined Items ................................................ 12 4.1. Definitions ..................................................... 12 4.2. Start States .................................................... 13 4.3. Rules ........................................................... 13 4.4. Action Statements ............................................... 13 5. Interface of the Generated Scanners ............................. 14 5.1. C ............................................................... 14 5.1.1. Scanner Interface ............................................... 14 5.1.2. Source Interface ................................................ 16 5.1.3. Scanner Driver .................................................. 17 5.2. Modula-2 ........................................................ 18 5.2.1. Scanner Interface ............................................... 18 5.2.2. Source Interface ................................................ 20 5.2.3. Scanner Driver .................................................. 21 6. Usage ........................................................... 22 7. Implementation .................................................. 24 8. Differences to LEX .............................................. 25 Appendix 1: Syntax of the Specification Language ................ 27 Appendix 2: Example Specification of a Modula-2-Scanner in C .................................................................... 29 Appendix 3: Example Specification of a Modula-2-Scanner in Modula-2 ........................................................... 32 Appendix 4: Example Specification of a Scanner for Rex .......... 35 References ...................................................... 41